Overview

Dataset statistics

Number of variables15
Number of observations501
Missing cells655
Missing cells (%)8.7%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory58.8 KiB
Average record size in memory120.3 B

Variable types

NUM10
CAT4
UNSUPPORTED1

Warnings

City has a high cardinality: 497 distinct values High cardinality
Female Population is highly correlated with Population [2011]High correlation
Population [2011] is highly correlated with Female PopulationHigh correlation
Population [2011] has 6 (1.2%) missing values Missing
Popuation [2001] has 501 (100.0%) missing values Missing
Median Age has 13 (2.6%) missing values Missing
Avg Temp has 14 (2.8%) missing values Missing
SWM has 9 (1.8%) missing values Missing
Toilets Avl has 22 (4.4%) missing values Missing
Water Purity has 19 (3.8%) missing values Missing
H Index has 15 (3.0%) missing values Missing
Female Population has 15 (3.0%) missing values Missing
# of hospitals has 17 (3.4%) missing values Missing
Foreign Visitors has 17 (3.4%) missing values Missing
City is uniformly distributed Uniform
Popuation [2001] is an unsupported type, check if it needs cleaning or further analysis Unsupported

Reproduction

Analysis started2020-09-14 22:16:40.553515
Analysis finished2020-09-14 22:17:00.796732
Duration20.24 seconds
Software versionpandas-profiling v2.9.0
Download configurationconfig.yaml

Variables

City
Categorical

HIGH CARDINALITY
UNIFORM

Distinct497
Distinct (%)99.2%
Missing0
Missing (%)0.0%
Memory size3.9 KiB
Shahpura
 
2
Sumerpur
 
2
Pratapgarh
 
2
Narsinghgarh
 
2
Rau
 
1
Other values (492)
492 
ValueCountFrequency (%) 
Shahpura20.4%
 
Sumerpur20.4%
 
Pratapgarh20.4%
 
Narsinghgarh20.4%
 
Rau10.2%
 
Kodungallur10.2%
 
Purwa10.2%
 
Reengus10.2%
 
Puthuppally10.2%
 
Todabhim10.2%
 
Other values (487)48797.2%
 
Frequencies of value counts

Unique

Unique493 ?
Unique (%)98.4%
Histogram of lengths of the category

Length

Max length24
Median length8
Mean length8.546906188
Min length3

State
Categorical

Distinct29
Distinct (%)5.8%
Missing0
Missing (%)0.0%
Memory size3.9 KiB
Uttar Pradesh
56 
Uttarakhand
51 
Tamil Nadu
51 
Maharashtra
51 
Rajasthan
48 
Other values (24)
244 
ValueCountFrequency (%) 
Uttar Pradesh5611.2%
 
Uttarakhand5110.2%
 
Tamil Nadu5110.2%
 
Maharashtra5110.2%
 
Rajasthan489.6%
 
Karnataka387.6%
 
Madhya Pradesh377.4%
 
Bihar224.4%
 
Gujarat224.4%
 
Kerala183.6%
 
Other values (19)10721.4%
 
Frequencies of value counts

Unique

Unique5 ?
Unique (%)1.0%
Histogram of lengths of the category

Length

Max length22
Median length10
Mean length10.04191617
Min length5

Type
Categorical

Distinct32
Distinct (%)6.4%
Missing2
Missing (%)0.4%
Memory size3.9 KiB
M
119 
N.P
86 
M.Cl
61 
T.P
42 
C.T
41 
Other values (27)
150 
ValueCountFrequency (%) 
M11923.8%
 
N.P8617.2%
 
M.Cl6112.2%
 
T.P428.4%
 
C.T418.2%
 
T.M.C224.4%
 
N.P.P193.8%
 
N.A193.8%
 
M.B163.2%
 
UA81.6%
 
Other values (22)6613.2%
 
Frequencies of value counts

Unique

Unique10 ?
Unique (%)2.0%
Histogram of lengths of the category

Length

Max length6
Median length3
Mean length2.932135729
Min length1

Population [2011]
Real number (ℝ≥0)

HIGH CORRELATION
MISSING

Distinct486
Distinct (%)98.2%
Missing6
Missing (%)1.2%
Infinite0
Infinite (%)0.0%
Mean24747.46869
Minimum110
Maximum36774
Zeros0
Zeros (%)0.0%
Memory size3.9 KiB

Quantile statistics

Minimum110
5-th percentile7577.7
Q121435
median25199
Q330763
95-th percentile35414.3
Maximum36774
Range36664
Interquartile range (IQR)9328

Descriptive statistics

Standard deviation7813.0675
Coefficient of variation (CV)0.3157117845
Kurtosis0.7106377163
Mean24747.46869
Median Absolute Deviation (MAD)4344
Skewness-0.9115545052
Sum12249997
Variance61044023.76
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%) 
2323430.6%
 
2345620.4%
 
2781520.4%
 
2278120.4%
 
2251620.4%
 
2164320.4%
 
630920.4%
 
2333120.4%
 
3617210.2%
 
2652110.2%
 
Other values (476)47695.0%
 
(Missing)61.2%
 
ValueCountFrequency (%) 
11010.2%
 
61210.2%
 
151710.2%
 
164110.2%
 
215210.2%
 
ValueCountFrequency (%) 
3677410.2%
 
3675410.2%
 
3673210.2%
 
3670610.2%
 
3666910.2%
 

Popuation [2001]
Unsupported

MISSING
REJECTED
UNSUPPORTED

Missing501
Missing (%)100.0%
Memory size4.0 KiB

Sex Ratio
Real number (ℝ≥0)

Distinct145
Distinct (%)29.2%
Missing5
Missing (%)1.0%
Infinite0
Infinite (%)0.0%
Mean895.5080645
Minimum774
Maximum991
Zeros0
Zeros (%)0.0%
Memory size3.9 KiB

Quantile statistics

Minimum774
5-th percentile839.5
Q1867.75
median890.5
Q3922
95-th percentile963
Maximum991
Range217
Interquartile range (IQR)54.25

Descriptive statistics

Standard deviation38.46415011
Coefficient of variation (CV)0.0429523213
Kurtosis-0.5558492085
Mean895.5080645
Median Absolute Deviation (MAD)27.5
Skewness0.2749158367
Sum444172
Variance1479.490844
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%) 
869102.0%
 
910102.0%
 
86891.8%
 
87591.8%
 
87481.6%
 
87081.6%
 
91981.6%
 
87771.4%
 
87671.4%
 
86371.4%
 
Other values (135)41382.4%
 
ValueCountFrequency (%) 
77410.2%
 
82310.2%
 
82710.2%
 
83151.0%
 
83210.2%
 
ValueCountFrequency (%) 
99110.2%
 
98610.2%
 
98510.2%
 
98310.2%
 
98210.2%
 

Median Age
Real number (ℝ≥0)

MISSING

Distinct10
Distinct (%)2.0%
Missing13
Missing (%)2.6%
Infinite0
Infinite (%)0.0%
Mean26.12090164
Minimum23
Maximum32
Zeros0
Zeros (%)0.0%
Memory size3.9 KiB

Quantile statistics

Minimum23
5-th percentile23
Q124
median26
Q328
95-th percentile29
Maximum32
Range9
Interquartile range (IQR)4

Descriptive statistics

Standard deviation2.145558807
Coefficient of variation (CV)0.08213953854
Kurtosis-0.8697617891
Mean26.12090164
Median Absolute Deviation (MAD)2
Skewness0.2156424531
Sum12747
Variance4.603422594
MonotocityNot monotonic
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
247515.0%
 
297414.8%
 
257114.2%
 
267014.0%
 
286813.6%
 
236412.8%
 
275410.8%
 
3051.0%
 
3251.0%
 
3120.4%
 
(Missing)132.6%
 
ValueCountFrequency (%) 
236412.8%
 
247515.0%
 
257114.2%
 
267014.0%
 
275410.8%
 
ValueCountFrequency (%) 
3251.0%
 
3120.4%
 
3051.0%
 
297414.8%
 
286813.6%
 

Avg Temp
Real number (ℝ≥0)

MISSING

Distinct27
Distinct (%)5.5%
Missing14
Missing (%)2.8%
Infinite0
Infinite (%)0.0%
Mean29.10061602
Minimum5
Maximum40
Zeros0
Zeros (%)0.0%
Memory size3.9 KiB

Quantile statistics

Minimum5
5-th percentile9
Q126
median31
Q336
95-th percentile39
Maximum40
Range35
Interquartile range (IQR)10

Descriptive statistics

Standard deviation9.295787556
Coefficient of variation (CV)0.3194361092
Kurtosis0.3458727314
Mean29.10061602
Median Absolute Deviation (MAD)5
Skewness-1.139441722
Sum14172
Variance86.41166629
MonotocityNot monotonic
Histogram with fixed size bins (bins=27)
ValueCountFrequency (%) 
35387.6%
 
34377.4%
 
38326.4%
 
26306.0%
 
39306.0%
 
25255.0%
 
37255.0%
 
31234.6%
 
28234.6%
 
29224.4%
 
Other values (17)20240.3%
 
ValueCountFrequency (%) 
571.4%
 
630.6%
 
751.0%
 
891.8%
 
961.2%
 
ValueCountFrequency (%) 
40204.0%
 
39306.0%
 
38326.4%
 
37255.0%
 
36193.8%
 

SWM
Categorical

MISSING

Distinct3
Distinct (%)0.6%
Missing9
Missing (%)1.8%
Memory size3.9 KiB
LOW
179 
HIGH
158 
MEDIUM
155 
ValueCountFrequency (%) 
LOW17935.7%
 
HIGH15831.5%
 
MEDIUM15530.9%
 
(Missing)91.8%
 
Frequencies of value counts

Unique

Unique0 ?
Unique (%)0.0%
Histogram of lengths of the category

Length

Max length6
Median length4
Mean length4.243512974
Min length3

Toilets Avl
Real number (ℝ≥0)

MISSING

Distinct62
Distinct (%)12.9%
Missing22
Missing (%)4.4%
Infinite0
Infinite (%)0.0%
Mean72.2776618
Minimum10
Maximum100
Zeros0
Zeros (%)0.0%
Memory size3.9 KiB

Quantile statistics

Minimum10
5-th percentile18.9
Q161
median74
Q390
95-th percentile98
Maximum100
Range90
Interquartile range (IQR)29

Descriptive statistics

Standard deviation20.79900178
Coefficient of variation (CV)0.2877652826
Kurtosis1.131275088
Mean72.2776618
Median Absolute Deviation (MAD)15
Skewness-1.039213286
Sum34621
Variance432.5984749
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%) 
69173.4%
 
97163.2%
 
92153.0%
 
94153.0%
 
96132.6%
 
71132.6%
 
50122.4%
 
95122.4%
 
80122.4%
 
57122.4%
 
Other values (52)34268.3%
 
(Missing)224.4%
 
ValueCountFrequency (%) 
1020.4%
 
1120.4%
 
1240.8%
 
1310.2%
 
1430.6%
 
ValueCountFrequency (%) 
10091.8%
 
9991.8%
 
9871.4%
 
97163.2%
 
96132.6%
 

Water Purity
Real number (ℝ≥0)

MISSING

Distinct99
Distinct (%)20.5%
Missing19
Missing (%)3.8%
Infinite0
Infinite (%)0.0%
Mean151.3589212
Minimum100
Maximum200
Zeros0
Zeros (%)0.0%
Memory size3.9 KiB

Quantile statistics

Minimum100
5-th percentile105
Q1127
median152
Q3175
95-th percentile194.95
Maximum200
Range100
Interquartile range (IQR)48

Descriptive statistics

Standard deviation28.71919055
Coefficient of variation (CV)0.1897423048
Kurtosis-1.168462088
Mean151.3589212
Median Absolute Deviation (MAD)24.5
Skewness-0.09407192221
Sum72955
Variance824.7919057
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%) 
114102.0%
 
136102.0%
 
144102.0%
 
12491.8%
 
11091.8%
 
16691.8%
 
14691.8%
 
17891.8%
 
16091.8%
 
17481.6%
 
Other values (89)39077.8%
 
(Missing)193.8%
 
ValueCountFrequency (%) 
10071.4%
 
10120.4%
 
10220.4%
 
10340.8%
 
10451.0%
 
ValueCountFrequency (%) 
20040.8%
 
19940.8%
 
19881.6%
 
19761.2%
 
19530.6%
 

H Index
Real number (ℝ≥0)

MISSING

Distinct486
Distinct (%)100.0%
Missing15
Missing (%)3.0%
Infinite0
Infinite (%)0.0%
Mean0.5010416349
Minimum0.0009574363038
Maximum0.9999010903
Zeros0
Zeros (%)0.0%
Memory size3.9 KiB

Quantile statistics

Minimum0.0009574363038
5-th percentile0.062014457
Q10.2666187181
median0.5082177095
Q30.737776037
95-th percentile0.944058368
Maximum0.9999010903
Range0.998943654
Interquartile range (IQR)0.4711573189

Descriptive statistics

Standard deviation0.2843004523
Coefficient of variation (CV)0.5674188182
Kurtosis-1.138861144
Mean0.5010416349
Median Absolute Deviation (MAD)0.2379787621
Skewness0.004863596619
Sum243.5062345
Variance0.08082674719
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%) 
0.886790433110.2%
 
0.174632115610.2%
 
0.318938115110.2%
 
0.40303301310.2%
 
0.241490006310.2%
 
0.744295168110.2%
 
0.0881441662310.2%
 
0.899663767110.2%
 
0.63023819310.2%
 
0.6389861710.2%
 
Other values (476)47695.0%
 
(Missing)153.0%
 
ValueCountFrequency (%) 
0.000957436303810.2%
 
0.00200120980710.2%
 
0.00380505830310.2%
 
0.00433816546910.2%
 
0.00541263366810.2%
 
ValueCountFrequency (%) 
0.999901090310.2%
 
0.99732278810.2%
 
0.996123144710.2%
 
0.995260046210.2%
 
0.991963549510.2%
 

Female Population
Real number (ℝ≥0)

HIGH CORRELATION
MISSING

Distinct482
Distinct (%)99.2%
Missing15
Missing (%)3.0%
Infinite0
Infinite (%)0.0%
Mean22542.63374
Minimum0
Maximum34523
Zeros1
Zeros (%)0.2%
Memory size3.9 KiB

Quantile statistics

Minimum0
5-th percentile7698.5
Q119449.75
median22998.5
Q327701.75
95-th percentile31957.5
Maximum34523
Range34523
Interquartile range (IQR)8252

Descriptive statistics

Standard deviation6931.232314
Coefficient of variation (CV)0.3074721611
Kurtosis1.044417715
Mean22542.63374
Median Absolute Deviation (MAD)3994.5
Skewness-0.9312305215
Sum10955720
Variance48041981.4
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%) 
1900420.4%
 
2185720.4%
 
1931620.4%
 
2736220.4%
 
2760010.2%
 
538210.2%
 
1952910.2%
 
887810.2%
 
1933610.2%
 
3022110.2%
 
Other values (472)47294.2%
 
(Missing)153.0%
 
ValueCountFrequency (%) 
010.2%
 
9410.2%
 
52210.2%
 
129210.2%
 
139210.2%
 
ValueCountFrequency (%) 
3452310.2%
 
3436010.2%
 
3432810.2%
 
3423710.2%
 
3411410.2%
 

# of hospitals
Real number (ℝ≥0)

MISSING

Distinct27
Distinct (%)5.6%
Missing17
Missing (%)3.4%
Infinite0
Infinite (%)0.0%
Mean19.17355372
Minimum3
Maximum30
Zeros0
Zeros (%)0.0%
Memory size3.9 KiB

Quantile statistics

Minimum3
5-th percentile7.15
Q114
median20
Q325
95-th percentile29
Maximum30
Range27
Interquartile range (IQR)11

Descriptive statistics

Standard deviation6.69714897
Coefficient of variation (CV)0.3492909592
Kurtosis-0.6789259198
Mean19.17355372
Median Absolute Deviation (MAD)5
Skewness-0.3042408893
Sum9280
Variance44.85180432
MonotocityNot monotonic
Histogram with fixed size bins (bins=27)
ValueCountFrequency (%) 
21336.6%
 
28295.8%
 
12265.2%
 
20255.0%
 
23244.8%
 
19244.8%
 
15234.6%
 
24224.4%
 
11214.2%
 
26214.2%
 
Other values (17)23647.1%
 
ValueCountFrequency (%) 
330.6%
 
491.8%
 
561.2%
 
651.0%
 
720.4%
 
ValueCountFrequency (%) 
30163.2%
 
29193.8%
 
28295.8%
 
27173.4%
 
26214.2%
 

Foreign Visitors
Real number (ℝ≥0)

MISSING

Distinct28
Distinct (%)5.8%
Missing17
Missing (%)3.4%
Infinite0
Infinite (%)0.0%
Mean1676300.903
Minimum798
Maximum4684707
Zeros0
Zeros (%)0.0%
Memory size3.9 KiB

Quantile statistics

Minimum798
5-th percentile34886
Q1284973
median923737
Q33104060
95-th percentile4684707
Maximum4684707
Range4683909
Interquartile range (IQR)2819087

Descriptive statistics

Standard deviation1704860.432
Coefficient of variation (CV)1.017037233
Kurtosis-1.041338164
Mean1676300.903
Median Absolute Deviation (MAD)755952
Skewness0.7625311928
Sum811329637
Variance2.906549094e+12
MonotocityNot monotonic
Histogram with fixed size bins (bins=28)
ValueCountFrequency (%) 
31040605611.2%
 
4684707499.8%
 
4408916499.8%
 
1475311479.4%
 
105882459.0%
 
636502377.4%
 
421365357.0%
 
923737224.4%
 
284973224.4%
 
977479183.6%
 
Other values (18)10420.8%
 
(Missing)173.4%
 
ValueCountFrequency (%) 
79810.2%
 
179710.2%
 
276930.6%
 
326020.4%
 
570520.4%
 
ValueCountFrequency (%) 
4684707499.8%
 
4408916499.8%
 
31040605611.2%
 
1489500142.8%
 
1475311479.4%
 

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Cramér's V (φc)

Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.

Missing values

Sample

First rows

CityStateTypePopulation [2011]Popuation [2001]Sex RatioMedian AgeAvg TempSWMToilets AvlWater PurityH IndexFemale Population# of hospitalsForeign Visitors
0TuensangNagalandT.C36774.0NaN931.023.010.0MEDIUM94.0114.00.25339034237.017.02769.0
1LakshmeshwarKarnatakaT.M.C36754.0NaN934.025.038.0HIGH62.0160.00.19255534328.013.0636502.0
2ZiraPunjabM.Cl.36732.0NaN883.029.035.0HIGH63.0105.00.88788232434.017.0242367.0
3YawalMaharashtraM.Cl36706.0NaN887.026.031.0HIGH60.0174.00.40783832558.011.04408916.0
4Thana BhawanUttar PradeshN.P.36669.0NaN877.028.039.0LOW92.0153.00.32445632159.023.03104060.0
5RamdurgKarnatakaUA36649.0NaN942.027.028.0MEDIUM92.0185.00.57188334523.030.0636502.0
6PulgaonMaharashtraM.Cl36522.0NaN887.026.031.0MEDIUM72.0108.00.27119532395.011.04408916.0
7SadasivpetTelanganaM36334.0NaN921.027.040.0LOW70.0116.00.49422733464.017.0126078.0
8NargundKarnatakaT.M.C36291.0NaN940.023.037.0LOW77.0148.00.70856234114.021.0636502.0
9Neem-Ka-ThanaRajasthanM36231.0NaN850.025.025.0MEDIUM61.0148.00.59232530796.029.01475311.0

Last rows

CityStateTypePopulation [2011]Popuation [2001]Sex RatioMedian AgeAvg TempSWMToilets AvlWater PurityH IndexFemale Population# of hospitalsForeign Visitors
491BhaiseenaRajasthanG.P3200.0NaN869.024.034.0LOW17.0167.00.0929572781.04.01475311.0
492DwarahatUttarakhandN.P2749.0NaN836.025.012.0HIGH18.0146.00.1867392298.08.0105882.0
493BadrinathUttarakhandN.P2438.0NaN848.029.012.0LOW19.0190.00.4329912067.04.0105882.0
494DogaddaUttarakhandN.P.P2422.0NaN840.026.011.0HIGH11.0146.00.0304212034.04.0105882.0
495DevprayagUttarakhandN.P2152.0NaN840.029.07.0MEDIUM14.0124.00.5030701808.08.0105882.0
496NandaprayagUttarakhandN.P1641.0NaN848.027.07.0MEDIUM12.0181.00.3169261392.04.0105882.0
497KirtinagarUttarakhandN.P1517.0NaN852.028.012.0HIGH16.0198.00.3368521292.06.0105882.0
498KedarnathUttarakhandN.P612.0NaN853.024.09.0LOW19.0189.00.723253522.06.0105882.0
499GangotriUttarakhandN.P110.0NaN852.027.08.0MEDIUM18.0170.00.42106194.08.0105882.0
500KumarganjUttar PradeshC.TNaNNaN863.024.035.0HIGH19.0149.00.1543750.06.03104060.0